A novel method-based reinforcement learning with deep temporal difference network for flexible double shop scheduling problem

This paper studies the flexible double shop scheduling problem (FDSSP) that considers simultaneously job shop and assembly shop. It brings about the problem of scheduling association of the related tasks. To this end, a reinforcement learning algorithm with a deep temporal difference network is proposed to minimize the makespan. Firstly, the FDSSP is defined as the mathematical model of the flexible job-shop scheduling problem joined to the assembly constraint level. It is translated into a Markov decision process that directly selects behavioral strategies according to historical machining state data. Secondly, the proposed ten generic state features are input into the deep neural network model to fit the state value function. Similarly, eight simple constructive heuristics are used as candidate actions for scheduling decisions. From the greedy mechanism, optimally combined actions of all machines are obtained for each decision step. Finally, a deep temporal difference reinforcement learning framework is established, and a large number of comparative experiments are designed to analyze the basic performance of this algorithm. The results showed that the proposed algorithm was better than most other methods, which contributed to solving the practical production problem of the manufacturing industry.


DRL scheduling problem
In recent years, RL, one of the three types of machine learning, has been successfully applied in some fields such as computing resource scheduling, robot control, and elevator scheduling.Among them, many scholars focused on the production scheduling system by RL.Liu et al. 29 proposed a parallel algorithm that utilizes asynchronous updates and deep deterministic policy gradients to solve the job shop scheduling problem.Using MMDP to build this model, the state space is represented in the JSSP environment by the processing time matrix, allocation matrix and activation matrix.And, action spaces are denoted by simple scheduling rules.Wei and Zhao 30 suggested the conception of the production pressure and the job's estimated mean lateness for respectively defining the system feature and the policy of reward or penalty.The Q-learning algorithm was applied to the determination of the composite machine rules.However, this method can't describe the actual complex machining process.Luo et al. 31 used the PPO algorithm to select processes in a discrete action space and verified its superiority in solving flexible job shop scheduling problems.However, the PPO algorithm has not been studied more thoroughly to improve its performance.Mouelhi-Chibani and Pierreval 32 proposed a neural network (NN) to dynamically select dispatching rules according to the current system status and the workshop parameters.RL can take the scheduling strategy which adapts to the actual system state.Song et al. 33 presented a method using DRL to learn priority dispatch rules (PDRs) and graph neural networks (GNNs) for FJSP.A new kind of heterogeneous graph scheduling state representation was employed to combine operation selection and machine allocation into one composite decision, which achieved high-quality learning of PDRs.Chen et al. 34 presented a rule-driven dispatching method based on the data envelopment analysis to solve the multi-objective dynamic job shop scheduling problem.An agent was trained to obtain the elementary rules with the WIP fluctuation of a machine.Shahrabi et al. 35 introduced the dynamic job shop scheduling problem (DJSSP) that considered machine breakdowns and random job arrivals.In their work, the dispatching rules were based on variable neighborhood search (VNS) and compared with some common dispatching rules and the general variable neighborhood search.Wang 36 designed an improved Q-learning with the clustering and greedy search policy.A dynamic scheduling system model with multi-agent technology was built including buffer, machine, state, and job agent to maximize the weighted mean of the fuzzy earning.Shiue et al. 37 established a procedure in which they planned the real-time scheduling knowledge base (RTSKB) using multiple dispatching rules (MDRs).Significantly, MDRs incorporated two mechanisms including an off-online learning module and a Q-learning-based RL module.So far, these algorithms have lacked a unified scheduling problem name.Che et al. 38 , applying a deep reinforcement learning based multi-objective evolutionary algorithm, proposed a multiobjective optimization model for the scheduling problem of oxygen production system.Yuan et al. 39 suggested a novel framework that translated a combined optimization problem into a multi-stage sequential decision-making problem.This framework is used a multi-agent double Deep Q-network algorithm for FJSP.
This research on the application of RL in these scheduling problems (Table 1) shows that RL is an effective method to solve the scheduling problem.This algorithm has the following characteristics: 1. RL is a decision-making algorithm directly oriented to long-term goals based on state or action value.2. RL doesn't need a complete mathematical model of the learning environment.It can imitate human experience, and learn and accumulate experience from the examples or simulation experiments that have been solved.3. RL needs supervision and teaching.It adjusts the policy according to the evaluation reward obtained in the interaction process.So, it makes optimal responses to different system states.

Mathematical model
We introduce FDSSP by considering the production scheduling problem of the hydraulic cylinder.The hydraulic cylinder processes flow diagram is simplified to a production scheduling model in Fig. 1.Each cylinder 40,41 is assembled from several components: body, bottom, piston, piston rod, lifting lug, O-ring, seal ring, piston pin, and wiper, as shown in Fig. 1a.The cylinder body 3 is generally made of seamless steel pipe.Its internal machining accuracy is highly required.Piston 4 and piston rod 6 are connected using snap ring 2. The piston rod 6 is guided by guide sleeve 7 and sealed by seal ring 5. Cylinder bottom 1 and body 3 are respectively opened with oil inlet and outlet ports.When the right chamber of the hydraulic cylinder is filled with oil, the piston moves left.Inversely, the piston moves right.The shop floor is divided into two areas, namely the job shop and the assembly shop.The product is started from the order and is finished with the assembly (Fig. 1b).The job shop is equipped with three machines (fine turning, CNC milling, and electric spark) (Fig. 1c).The assembly workshop has two assembly robots ( A 1 , A 2 ) (Fig. 1d).Each operation can be completed by multiple alternative machines.After each operation k is completed, job j needs to enter the quality control center for quality inspection.If the quality is acceptable, a job is moved to the next operation k + 1; if instead, it is returned to the current operation to be queued and reworked again.The assembly operation is a complete kit assembly, which means that the assembly operation does not begin until the all job is completed.Since the assembly operation may be relatively short and fixed, the planned start time of the assembly operation can be extrapolated from the delivery date of the order.In this paper, we concentrate on the job shop scheduling in a way that the completion time of each job is as close to the planned start time of the assembly as possible.The assembly shop is defined as the assembly constraint level.Based on the above example, the FDSSP can be described as follows: supposing that there are n jobs to be processed in the job shop equipped with m machines.Each job j ( j ∈ {1, 2, . . ., N} ) including O operations k ( k ∈ {1, 2, . . ., O} ) needs to be processed according to the specified route.Each operation k can be selected processing on any powerful machines m ( m ∈ {1, 2, . . ., M ij } ) in M ij machines.Meanwhile, the machine m can process different operations k of different jobs.Hence, there is a great discrepancy in the processing time of the operation k on different machines, which makes the study of scheduling algorithms particularly significant.The model parameters and indices are shown in Table 2.

Assembly restraint level definition
The job after assembly is referred to as the constrained job, and the job before assembly is referenced as the front job.Firstly, according to the assembly constraint relationship, all jobs constraint levels that have no tight front constraint are set to 1. Jobs with undefined constraint levels make up the job set, which is denoted by U.Then, the job set J set is formed from U in sequence taking out all tight front jobs J k .Determining whether the constraint levels in J set have all been determined.If so, the level of the job J k is set to max(L(J set ) + 1) , i.e., www.nature.com/scientificreports/L(J k ) = max(L(J set )) + 1 .When not, it puts the job J k back into U until the constraint levels of all jobs have been determined.Other assumptions are considered as follows: 1.The processing times of each operation by each machine are determined and known.2. Each job can select only one process path.And, one operation can only be processed by one machine at a time.
3. The sum of the start time and processing time of an operation is less than or equal to the makespan of the operation.4. The makespan of the previous operation is less than or equal to the start time of the next operation.5. Completion time of products is the sum of processing time and assembly time.6.The operation of each machine is cyclic.7. Intermediate conversion time of the job, transferring from the job shop to the assembly shop, is omitted.

Decision variables:
According to the literature reviewed 7,[42][43][44][45] makespan is the most sufficiently studied objective.In this study, the objective of the model is as follows:

Definition of state-space
The state features can reflect the main features of the production environment.The division of state space is the basis for the reasonable selection of scheduling rules for the system.Nevertheless, owing to the constantly changing production environment, the complete system state is continuous and often described by tens or even dozens of state characteristics on the job shop.
To describe the state space in detail, the following state features are defined: 1.The state features can describe the main features and changes of the scheduling environment in detail, including the global features and local features of the system.2. The states of all problems are represented by a common feature set.3. Different scheduling problems can be represented and summarized by state features.4. State feature is a numerical representation of state variables.5.The state should be easy to calculate.
x jkm = 1, if operation k of job j is processed on machine m 0, otherwise . (1) x jkm t jkm .To facilitate the expression with the formula, the processing state of the processes is recorded as P jk = {0, 1} , i.e., the operations are not processed is P jk = 0 , and has been processed is recorded as P jk = 1 .The operations to be processed on the machine are arranged in descending order of time length, and the resulting process sequence is denoted as list(m) = {J m1 , J m2 , ..., J mv m } , where v m is the number of processes to be processed on machine m.As shown in Table 3, we define ten state features of the shop environment.

Definition of action space
Panwalker and Iskander 46 , summarizing the previous studies, elaborated 113 different combinations of dispatching rules.These rules defined the useful types of problems and measures of performance.The SCH is chosen to define a candidate set of behaviors for each machine, where priority assignment rules for reinforcement learning can overcome short-sighted natures.Behaviors that are relevant or irrelevant to the conversion should be adopted to take full advantage of existing scheduling theory and the ability of the intelligence to learn from it.In Table 4, eight common behaviors are selected as candidate sets.

Definition of rewards
The definition of the reward function is closely related to the objective function.The agent is rewarded according to the result of the change of the system state after the implementation of the synthetic behavior and the reward function.The reward function is chosen to be defined according to the following rules.
1.The immediate reward for each state transition reflects the immediate effect of the action performed, which results in a short-term impact on the scheduling plan.2. The cumulative total reward result reflects the long-term outcome of the execution strategy, denoted as the optimal value of the objective function.3.This reward function can be applied to scheduling problems of different sizes.
The literature 47 shows a direct relationship between C max and machine utilization (e.g., minimizing the makes- pan is equal to maximum machine utilization).This study is devoted to addressing minimizing the makespan.The immediate reward earned for each state transition reflects the immediate impact of the action performed.

State features Description
Total time of operations to be processed on machine m x m,2 = Ojp j=1,k=1,P j,k =0 T j,k Total time of operations processed on machine m x m,3 = J P j,k=0 m,1 Time of the first operation in the sequence List(m) to be processed on the machine m x m,4 = J P j,k=1 m,2 Time of the second operation in the sequence List(m) to be processed on machine m x m,5 = W P j,k=1 j,1 Among all future operations, the time of the first operation in the sequence List(m) to be processed on the machine m x m,6 = W P j,k=0 j,2 Among all future operations, the time of the second operation in the sequence List(m) to be processed on the machine m x m,7 = n i=1 (1 − P j,k ) Total number of operations for all future processes on the machine m Total time for all future operations on machine m Total number of all jobs assembly constraint levels on the machine m www.nature.com/scientificreports/It also represents the short-term impact of the action on the scheduling scheme.Cumulative rewards reflect the long-term effects, which is the goal of RL maximization.
where U ave (t) is the average machine utilization rate.Let C max (t) denotes the completion time of the last operation assigned on machine m at scheduling point t.O t is the current number of operations for the job i that have been assigned.Define the machine m utilization rate as U k (t) , which can be calculated by , then the cumulative reward R can be calculated as follows:

Proof
where k is the counter for the allocation operation.It can be considered as a discrete-time step in RL.
k=1 t jkm x jkm .U k (t) and C max (t) are machine utilization and makespan at time step k.

Proposed methods for the FDSSP Related work of RL
RL is a specific class of machine learning (ML) problems that can achieve global optimality [48][49][50] .In an RL model 51 , the decision-maker chooses an appropriate action by observing the environment and is rewarded for doing so.RL algorithms needn't know many states and the state transfer probability matrix during iterations.RL is transformed into the model of solving the optimal solution of Markov decision models, which is mainly used to solve sequential decision problems.The most important feature of RL is that there is no correct answer in the learning process, rather learning is done through reward signals.

Markov decision processes (MDP)
Markov is the property that the next state s t+1 in an RL system is related only to the current state s t .The Markov decision process is described by 5-tuple as follows: where S is a finite set of states, characterizing the description of the environmental state; A is a finite set of action spaces, representing the set of behaviors that can be taken; P is a state transfer rate function; R is a reward function; γ is a discount factor.
The objective of RL is to enable the agent to find an optimal strategy π * through continuous experimentation in the environment that maximizes the expected cumulative reward function obtained by following the strategy from any state.The reward function is determined by further defining the value function.The state-value function v π (s) and the action-value function q π (s) under the strategy are defined as follows. (2) .
Vol:.( 1234567890 www.nature.com/scientificreports/Updating Bellman's expectation equation with the optimal strategy yields the optimal equation as follows:

Temporal difference algorithm
The TD algorithm 52 , combining Monte Carlo and dynamic planning methods, uses the classical Bellman formula to iterate until the value function converges.The basic iteration formula is as follows: where r t+1 +γ V (s t+1 ) is the objective of TD; r t+1 +γ V (s t+1 ) − V (s t ) is the deviation of TD; α is the learning rate.The procedure to calculate v(s) is given in Algorithm 1.

Deep neural network
Deep learning 53 is a type of representation learning that is based on artificial neural networks.Deep neural network structures have greater capacity and exponential representation space, which makes it easier to learn and represent a variety of features with a significantly reduced number of neurons.
The recent success of deep learning relies heavily on massive amounts of training data, flexible models, sufficient computing power, and prior experience to fight against dimensional disasters.Hinton 54 has proposed a technique combining pre-training and fine-tuning to drastically reduce the time training a multi-layer neural network.Various optimization techniques have emerged to further alleviate the gradient disappearance problem.In particular, an application of a technique known as "Deep Residuals" 55 can enable more than a hundred network layers.Algorithm 1 1: Input: Initialize playback memory to capacity 2: Initialize states value function with weights 3: Initialize the target state value function ̂with weights − = 4: For episode =1, do 5: Initialize sequence 1 = { 1 } and pre-processed sequence 1 = ( 1 ) 6: For = 1, do 7: with probability or eq.( 7) Select a random action 8: otherwise select = ( ( ), ; ) 9: Execute action in the emulator and observe the reward and image +1 10: Set +1 = , , +1 and pre-process +1 = ( +1 ) 11: Store transition ( , , , +1 ) in 12: Sample random minibatch of transitions ( , , , +1 ) from 13: Set:

Activation function
The activation function, a central unit in the design of neural networks, gives the ability to learn and adapt for the neurons 56,57 .It incorporates nonlinear factors in the neural network to address the defect of expression ability of the linear model.If the activation function isn't used, the output of each layer is a linear function of the inputs of the previous layer.No matter how many layers the neural network has, the output is a linear combination (6) of the inputs.Common activation functions include step functions, Sigmoid functions, Tanh functions, and approximate biological neuronal activation functions such as Relu, Leaky-Relu, and Softplus.Because approximate biological neuronal activation functions are better than traditional functions in most network applications, Relu is used in this paper.

Optimization function
Optimization function 58 , one of the core problems in neural network training, not only speeds up the solution process but also reduces the influence of hyperparameters on the solution process.Common optimization algorithms used in research applications are the stochastic gradient descent algorithm (SGD), adaptive gradient algorithm (AdaGrad), root mean square prop algorithm (RMSProp), and Adam algorithm.
In this paper, the deep neural network is made up of seven connection layers, which contain one input layer, five implicit layers, and one output layer.Figure 2b gives the structure of the neural network.

Exploration and exploitation
FDSSP can be classified as a multi-stage decision-making problem with terminals.To balance the allocation of exploration and exploitation of the agent in environmental interactions, a greedy strategy ε is used as a strategy for selecting behavior.Greedy strategy is the selection of a greedy behavior with probability 1 − ε ( 0 < ε < 1 ) and the random selection of any optional behavior with probability ε , where ε is the exploration factor.Suppose P(s, a) denotes the probability of selecting a behavior at the decision state.The expression is as follows: where A(s) is the set of combinatorial behaviors that are candidates in the state s ; |A(s)| is the number of behaviors that can be chosen in the state s ; a * (s) is the greedy behavior of the state.It denotes Eq. ( 7) as follows: where r a ss′ is the immediate reward that takes a combination of actions from state s to state s , .

Deep temporal difference network model
To briefly describe the implementation process, a workshop visualization ( m = 3, n = 3 ) is proposed in Fig. 2. The hexagonal shape represents the jobs.Hexahedra represents waiting for queues of sufficient capacity.At the start of processing, the scheduling system is in the initial state S 0 , i.e., all jobs are in the first waiting queue Q 1 with all machines free.Then the first machine selects an action a(k) ( 1 < k < 8 ).A job in the queue Q 1 is selected for processing while other machines select the action a(8) .Whenever any machines complete an operation, the system moves to a new state S t .In this state, each machine selects an action to perform.When  another operation is completed, the system moves to a new state S t+1 , which gives the agent one reward r t+1 .r t+1 can be calculated by the time interval between the two states.Since at each decision moment, each machine simultaneously selects one act to execute.In actuality, the system implements a multidimensional behavior with a combination of m sub-activities at a time in the state S t (a t+1 = (a 1 , a 2 , ..., a m )) .When the system reaches the termination state S T , it means that all queues are empty and that all jobs have been processed.Hence, a scheduling plan is obtained.
The deep Q-network (DQN) output layer uses several nodes to represent a finite number of discrete action values.However, it cannot cover the exponential multidimensional action space.When the Q-learning online evaluates action values for heterogeneous strategies, it results in over-estimation that optimal value replaces actual interaction values.Hence the method is not directly applied to the multi-dimensional action space problem.Temporal difference learning with the same strategy, state-values indirectly calculating action-values, is proposed to replace Q learning and state values are indirectly calculated for behavior values, which is suitable for selecting multi-dimensional action in Algorithm 2.

Experiment study
To evaluate the validity of the proposed algorithm, the experiments have been conducted utilizing different test cases in four parts.First of all, according to the standard test set established in Kacem 59 , we use eight small-scale cases to compare with other algorithms 17,32,60,61 in "Small scale FDSSP" section.Then, in "Comparisons with the proposed dispatching rules" section, we compare the proposed DTDN algorithm with the Q-Learning (QL) algorithm (Jiménez 62 ) and deep deterministic policy gradient (DDPG) algorithm (Liu 63 ) on different performances in Brandimarte 64 .Moreover, for large-scale instances, we designed our test cases, which included 30 FDSSP problems of varying complexity, as shown in Section "Large-scale instances of FDSSP".Last but not least, in "Case Study: production scheduling problem" section, we illustrate in detail the application of our algorithm in a case study of solving the hydraulic cylinder production scheduling problem.
The DTDN algorithm is coded in Python 3.7 language on JetBrains PyCharm Community Edition 2019.2.1 × 64 and runs on Intel Core i9-10900x @ 3.7GHz CPU and 16 GB RAM.First of all, we build FDSSP environment classes, machine classes, and job classes in an object-oriented manner on the RL platform OpenAI Gym.Gym specifies the main member methods of environment classes as a framework, including init, reset, step, render, and close.Then, an agent that executes the algorithm iterates interactively with the environment.The deep neural network model of the agent is implemented with the back-end TensorFlow.The experimental data are shown in Table 5.
The selection of parameters may affect the quality of the solution, thus general principles can be followed.The discount factor γ measures the weight of the subsequent state value on the total return, which is why it generally takes a value close to 1 (i.e.,γ = 0.95 ).To facilitate full exploration of the strategy space during the initial phase of the iteration, the ε-greedy strategy sets the initial value of ε = 1 and decays with the discount rate of 0.995.Set the learning rate: α = 5 × 10 −4 ; the maximum number of interactions: MAX − EPISODE = 800 ; memory D capacity: N = 6000 ; and sample batch: BATCH − SIZE = 64 .The deep neural network of the agent is shown in Fig. 2b, in which the network parameters adopt a random initialization strategy.
Performance metrics: The relative percentage deviation (RPD) and average relative percentage deviation (ARPD) are described as follows: where C max are the optimal results of algorithms; LB is the optimal results of the Branch and Bound algorithm.It represents the most ideal solution result and is not possible to achieve.

Small scale FDSSP
To prove the validity of the solution process in this study, the cases proposed by Kacem.are validated.Where, the number of jobs (n), the number of machines (m), and each operation of jobs ( O ij ) are represented.For example, n × m is a case of a set consisting of jobs and machines.The literature with the same case study as this paper is selected for comparison [Zhang 17 (DACS); Xing et al. 60 (SM); Moslehi 32 (PSO); Li et al. 61 (HTSA)], which ensures the credibility of the comparison results.Meanwhile, each case is run ten times to obtain the combined optimal solution.CPU times of various algorithms are calculated by the "relative ratio" downloaded from https:// www.cpube nchma rk.net/ (Table 6).The results show that the optimal solution of DTDN and other algorithms are the same as LB, but CPU running time is very significantly different for small-scale problems (Kacem) in Table 7.

RPD =
C max − LB LB  8a, it can be seen that the algorithm progressively generates an optimal production schedule ( 35).An optimal policy set {π * } = {(6, 8, 8, 8), (2, 1, 8, 8), . . ., (8, 8, 8, 1)} is the operation sequence of each job on the machine.Where the number of parentheses in the policy set indicates the combined behavior of the four machines consisting of the behavior number taken in the corresponding state.Where the number in parentheses in the policy set indicates that the four machines in the corresponding state consisting of the behavior number adopt the combined action.At each decision time point, since most of the machine waiting queues are empty or in-process, their feasible action space includes only a(8) , which saves computation time.Moreover, the comparison of test results for Problem 2 (Table 8b) shows that the optimal solution of the DTDN algorithm ( 426) is improved by 4.3% and 2.7% compared to the Nawaz Encore Han (NEH) (445) algorithms and NEH-KK algorithms (438), respectively.

Comparisons with the proposed dispatching rules
To verify the efficiency and generality of the proposed DTDN, we planned the Brandimarte 64 data set as our adopted data set.Scores and RPD of the seven dispatching rules on each data case are tallied.As known from Table 9, the proposed DTDN algorithm compared with other algorithms can obtain better solutions, and some of them are already below the upper bound of the original cases.The actions that are used more than 10% are FIFO, SPT, LPT, SRRT, LRPT, MOR, and EDD in Fig. 3.It is known that these actions have a greater contribution to obtaining the optimal solution and thus have a greater utilization value.The frequency distribution of other actions was relatively even, but the performance was not obvious.Therefore, it can be considered to add other heuristic behaviors to the candidate action space, which eliminates some underutilized behaviors to streamline the actions.

Large-scale instances of FDSSP
In this study, to study the performance of DTDN on large-scale problems, the results of Brandimarte cases are further compared with problem sizes ranging from BC, DP, and BR data cases in a total of 30 data cases.The solution results of the proposed DTDN are compared with Gao et al. 65 (HGA), Mastrolilli and Gambardella 66 (MG), Sun et al. 67 (HMEA), Chen et al.69 (SLGA), and Reddy 68 [teaching-learning based optimization (TLBO)], which are shown in Table 10 and Fig. 4. The test results show that the proposed DTDN algorithm can find better computational results globally through a large amount of trial and error in the solution procedure.The obtained performance index results are better than traditional optimization methods for different scales of arithmetic cases, demonstrating the validity of the DTDN algorithm for FDSSP.
Case study: production scheduling problem Nourali 8,9 proposed a useful benchmark of FDSSP, including 40 different data cases.The solution results of the proposed DTDN are compared with Huang et al 69 [particle swarm optimization (PSO)]; Zhang and Wong 7 (constraint programming (CP)), and Zhang et al. 17 [distributed ant colony system (DACS)].The results are displayed in Table 11 and Fig. 4, and the following conclusions can be summarized.The optimal solutions of this algorithm are all within [LB, UB], indicating that the solutions are valid.The performance of DTDN is close to that of the other three algorithms.The run time from CPU Time is about as long as the other algorithms for small-scale cases, but the large-scale problems are much more efficient than the other method.Lastly, in general, this algorithm is slightly less capable of solving large-scale problems because of the large scheduling state space for large-scale problems, the large learning error using the same network structure, and the need for more iterations and a more optimized network structure to reduce the training error.

Conclusion
The main contribution of this paper is to propose an efficient DTDN method for FDSSP in a flexible shop production environment to minimize makespan.The Q learning in the deep reinforcement learning algorithm DQN is transformed into the temporal differential TD learning with state value.Hence, the deep temporal differential reinforcement learning algorithm is obtained, which is successfully applied to the shop scheduling problem.As shown by experiments, the algorithm can obtain a better solution in a smaller number of iterations compared to simply constructed heuristic or population intelligence algorithms.Because of the introduction of state features, heuristic behaviors, and deep neural networks, the algorithm is highly flexible and dynamic.The advantages of the proposed algorithm include as following: 1.The algorithm can learn and real-time.Since the selection from the input state to the neural network is made by the SCH Algorithm with basic rules.When the neural network is successfully trained, the previous empirical patterns are stored in the network parameters that can make scheduling decisions in real time.2. The algorithm model is more flexible.The state features, behavior rules, and neural network size can be flexibly modified as needed.The constructive process is closer to the actual scheduling, which is not only applicable to NPFS problems with greater computational complexity but also suitable for solving dynamic scheduling problems from the principle.10 and 11.

( 10 )Figure 2 .
Figure 2. DTDN algorithm running model (3 × 3): Deep neural network model of state perception in agent: (a) Deep neural network model of state perception; (b) Deep neural network structure.

Figure 4 .
Figure 4. Box plots based on the results in Tables10 and 11.

Table 1 .
Summary of relevant RL methods.

Table 2 .
Model parameters and indices.
Sorted by the due date from shortest to longest Apparent tardiness cost (ATC) Sorted by the tardiness cost from shortest to longest Total least operations remaining (TLOPR) Sorted by assembly-related constraints of the job from shortest to longest Select no job (SNJ) Machines don't select processing each job Vol.:(0123456789) Scientific Reports | (2024) 14:9047 | https://doi.org/10.1038/s41598-024-59414-8

Table 6 .
Relative ratio of different computers in studies.

Table 7 .
Results comparison of makespan and CPU time in different methods on Kacem's test cases.

Table 9 .
Results comparison of scheduling score and RPD in different methods on Orb data cases.Performance comparison of action space (dispatching rules) under different data cases.

Table 10 .
Results comparison of scheduling score and RPD in different methods on Orb data cases.

Table 11 .
Due to the shortcomings of the study, further work can be considered in the following aspects.1.Scheduling model.Significantly, RL can add and subtract state features to better describe the processing state with minimal redundancy.Searching for more efficient and practical heuristic behaviors can fit and generalize stronger value function generalizer structures.What's more, adding or subtracting candidate behavior sets can be considered to add more highly utilized constructive heuristic behaviors.2. Algorithm procedure.The DTDN algorithm itself has been proposed after many types of improvements.For example, the DTDN algorithm with priority playback memory for memory sampling priority can improve the efficiency of algorithm iteration.3. Algorithm application.There is a large development space for the algorithm with the continuous progress of deep neural network theory and the increasing computer computing power.The algorithm can be extended to apply to more complex job shop scheduling problems and other dynamic scheduling problems.Results Comparison of makespan and RPD in extended Nourali's test cases.